{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "81141247-631a-4b2b-8822-20dff7735c76",
   "metadata": {},
   "source": [
    "# 系统化的数据问题诊断\n",
    "\n",
    "在进行数据清洗之前，我们首先需要对数据质量进行全面的诊断。这就像医生在制定治疗方案前，需要先进行全面的体检一样。一个系统化的数据审核流程不仅能帮助我们发现数据中的问题，还能帮助我们理解这些问题的性质、严重程度和潜在影响。\n",
    "\n",
    "例如，在处理一个电商平台的交易数据时，单纯发现某些订单金额为负数并不足够。我们需要理解：这是数据录入错误？还是退款记录？又或是系统计算误差？只有通过系统化的审核，我们才能做出准确的判断，并选择合适的处理方法。\n",
    "\n",
    "## 一、数据完整性审核：基础但关键的第一步\n",
    "\n",
    "### 完整性审核的重要性\n",
    "\n",
    "数据完整性审核是整个数据质量评估的基础。想象你在分析用户行为数据，如果关键的用户ID或行为时间戳缺失，那么后续的分析就会变得毫无意义。更糟糕的是，如果我们没有及时发现并处理这些缺失值，可能会导致分析结果产生严重偏差。\n",
    "\n",
    "### 完整性审核的层次\n",
    "\n",
    "完整性审核需要从多个层次展开：\n",
    "\n",
    "1. **系统层面的缺失**\n",
    "   - NULL值：数据库中的真实空值\n",
    "   - NaN值：Python/Pandas中的特殊空值\n",
    "   - 空字符串：看似有值但实际为空的情况\n",
    "\n",
    "2. **业务层面的缺失**\n",
    "   - \"N/A\"、\"未知\"等表示缺失的文本\n",
    "   - 0或-1等特殊值表示的缺失\n",
    "   - 明显不合理的默认值（如1900-01-01这样的日期）\n",
    "\n",
    "3. **上下文层面的缺失**\n",
    "   - 必填字段的缺失（如订单必须有订单号）\n",
    "   - 条件性必填字段的缺失（如选择\"其他\"时必须填写备注）\n",
    "   - 关联数据的缺失（如有订单号但无对应的订单详情）\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afee0944-7a55-4266-9027-89fd9611e59e",
   "metadata": {},
   "source": [
    "## Pandas审核核心函数详解\n",
    "\n",
    "### 1. isnull() 与缺失值检测\n",
    "\n",
    "`isnull()`是Pandas中最基础的缺失值检测函数。它的工作原理是遍历DataFrame中的每个元素，检查是否为缺失值（NULL或NaN）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1ab17556-e657-4133-a901-3671bc1c3728",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== 基础缺失值检查 ===\n",
      "      列1     列2\n",
      "0  False  False\n",
      "1   True  False\n",
      "2   True   True\n",
      "3  False  False\n",
      "\n",
      "=== 每列缺失值统计 ===\n",
      "列1    2\n",
      "列2    1\n",
      "dtype: int64\n",
      "\n",
      "=== 缺失值比例 ===\n",
      "列1    50.0\n",
      "列2    25.0\n",
      "dtype: float64 %\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# 创建示例数据\n",
    "data = {\n",
    "    '列1': [1, None, np.nan, 4],\n",
    "    '列2': ['a', '', None, 'd']\n",
    "}\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "# 基础检查\n",
    "print(\"=== 基础缺失值检查 ===\")\n",
    "print(df.isnull())  # 返回布尔矩阵\n",
    "\n",
    "# 按列统计缺失值\n",
    "print(\"\\n=== 每列缺失值统计 ===\")\n",
    "print(df.isnull().sum())\n",
    "\n",
    "# 缺失值比例\n",
    "print(\"\\n=== 缺失值比例 ===\")\n",
    "print((df.isnull().sum() / len(df) * 100).round(2), '%')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "717a1fea-b18e-4db4-9ba2-ed90f6f88229",
   "metadata": {},
   "source": [
    "### 2. sum() 与数据汇总\n",
    "\n",
    "`sum()`函数不仅可以计算数值的总和，还可以配合布尔值进行计数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1edeaa30-9a1e-4c83-8417-4517a3cb75a8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== 数值汇总 ===\n",
      "总销售额： 1100.0\n",
      "总数量： 12.0\n",
      "\n",
      "=== 条件计数 ===\n",
      "高额销售数： 2\n"
     ]
    }
   ],
   "source": [
    "# 数值汇总\n",
    "print(\"=== 数值汇总 ===\")\n",
    "numeric_data = pd.DataFrame({\n",
    "    '销售额': [100, 200, 300, None, 500],\n",
    "    '数量': [1, 2, None, 4, 5]\n",
    "})\n",
    "\n",
    "print(\"总销售额：\", numeric_data['销售额'].sum())\n",
    "print(\"总数量：\", numeric_data['数量'].sum())\n",
    "\n",
    "# 条件计数\n",
    "print(\"\\n=== 条件计数 ===\")\n",
    "# 计算大于200的销售额数量\n",
    "print(\"高额销售数：\", (numeric_data['销售额'] > 200).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c67d7811-bc38-4c81-be9f-96f84db5f257",
   "metadata": {},
   "source": [
    "### 3. select_dtypes() 与数据类型筛选\n",
    "\n",
    "这个函数允许我们根据数据类型选择特定的列，这在处理大型数据集时特别有用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "46e22e02-4a4c-4067-8c52-bcd02dc26971",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== 数值类型列 ===\n",
      "   整数   浮点\n",
      "0   1  1.1\n",
      "1   2  2.2\n",
      "2   3  3.3\n",
      "\n",
      "=== 非数值类型列 ===\n",
      "  文本         日期\n",
      "0  a 2024-01-01\n",
      "1  b 2024-01-02\n",
      "2  c 2024-01-03\n"
     ]
    }
   ],
   "source": [
    "# 创建混合类型数据\n",
    "mixed_data = pd.DataFrame({\n",
    "    '整数': [1, 2, 3],\n",
    "    '浮点': [1.1, 2.2, 3.3],\n",
    "    '文本': ['a', 'b', 'c'],\n",
    "    '日期': pd.date_range('2024-01-01', periods=3)\n",
    "})\n",
    "\n",
    "# 选择数值类型列\n",
    "print(\"=== 数值类型列 ===\")\n",
    "numeric_cols = mixed_data.select_dtypes(include=['int64', 'float64'])\n",
    "print(numeric_cols)\n",
    "\n",
    "# 选择非数值类型列\n",
    "print(\"\\n=== 非数值类型列 ===\")\n",
    "non_numeric_cols = mixed_data.select_dtypes(exclude=['int64', 'float64'])\n",
    "print(non_numeric_cols)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13b29cab-0123-4546-accc-70c523a1af53",
   "metadata": {},
   "source": [
    "### 4. str.strip() 与字符串处理\n",
    "\n",
    "处理文本数据中的空格和特殊字符是数据清洗的重要部分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3337464f-437c-43dc-b9a1-3b7359216c1a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== 字符串清理 ===\n",
      "原始数据：\n",
      "0     abc \n",
      "1    def  \n",
      "2      ghi\n",
      "3     jkl \n",
      "dtype: object\n",
      "\n",
      "清理后：\n",
      "0    abc\n",
      "1    def\n",
      "2    ghi\n",
      "3    jkl\n",
      "dtype: object\n",
      "\n",
      "=== 空字符串检查 ===\n",
      "空字符串数量： 0\n"
     ]
    }
   ],
   "source": [
    "# 文本清理示例\n",
    "text_data = pd.Series([' abc ', 'def  ', '  ghi', ' jkl '])\n",
    "\n",
    "print(\"=== 字符串清理 ===\")\n",
    "print(\"原始数据：\")\n",
    "print(text_data)\n",
    "print(\"\\n清理后：\")\n",
    "print(text_data.str.strip())\n",
    "\n",
    "# 检查空字符串\n",
    "print(\"\\n=== 空字符串检查 ===\")\n",
    "has_empty = text_data.str.strip().eq('')\n",
    "print(\"空字符串数量：\", has_empty.sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43d24920-39f5-4353-a33c-0ae9071c8efc",
   "metadata": {},
   "source": [
    "### 5. astype() 与类型转换\n",
    "\n",
    "数据类型转换是数据预处理的关键步骤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9b4218d4-dc19-404e-af2a-7f020a1c97a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== 类型转换 ===\n",
      "原始类型：\n",
      "数值字符串    object\n",
      "分类数据     object\n",
      "dtype: object\n",
      "\n",
      "转换后类型\n"
     ]
    }
   ],
   "source": [
    "# 类型转换示例\n",
    "data = pd.DataFrame({\n",
    "    '数值字符串': ['1', '2', '3'],\n",
    "    '分类数据': ['A', 'B', 'A']\n",
    "})\n",
    "\n",
    "print(\"=== 类型转换 ===\")\n",
    "print(\"原始类型：\")\n",
    "print(data.dtypes)\n",
    "\n",
    "# 转换为数值\n",
    "data['数值字符串'] = data['数值字符串'].astype(int)\n",
    "# 转换为分类\n",
    "data['分类数据'] = data['分类数据'].astype('category')\n",
    "\n",
    "print(\"\\n转换后类型\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ecbf53d-4307-4ebd-a8cb-baa3b6c018ee",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
